Data description

We consider data from the US Energy Information Administration’s online data browser.The data was downloaded from Table 2.2 of the US Energy Information Administration’s online data browser (https://www.eia.gov/totalenergy/data/browser/). The data set contains multiple time series, detailing:

The aim of this study is to analyse the monthly residential energy consumption in the US, since January 2000.We will consider two cases:

The data includes 7,896 observations for each month of each year since 1949 with 6 columns as it can be seen in the table below.

MSN YYYYMM Value Column_Order Description Unit
CLRCBUS 194913 1271.551 1 Coal Consumed by the Residential Sector Trillion Btu
CLRCBUS 195013 1261.267 1 Coal Consumed by the Residential Sector Trillion Btu
CLRCBUS 195113 1158.679 1 Coal Consumed by the Residential Sector Trillion Btu
CLRCBUS 195213 1079.206 1 Coal Consumed by the Residential Sector Trillion Btu
CLRCBUS 195313 965.664 1 Coal Consumed by the Residential Sector Trillion Btu

In particular:

Data cleaning

Before processing the data, we extract the abbreviations (MSN) and descriptions(Description), and we divide them into two data frames, one containing all individual energy consumption per source and one containing the total values. Those will be later used for the visualizations. The two tables can be seen below:

##Extract codes and names
code<-data.frame(Detail=unique(orig_dat$MSN),Desc=unique(orig_dat$Description))

##Individual Values
ind<-code %>% 
  filter(!grepl('Total', Desc))

##Total Values 
tot<-code %>% 
  filter(grepl('Total', Desc))

Next, we look into the data type of each column.

str(orig_dat)
## spec_tbl_df [7,896 × 6] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ MSN         : chr [1:7896] "CLRCBUS" "CLRCBUS" "CLRCBUS" "CLRCBUS" ...
##  $ YYYYMM      : num [1:7896] 194913 195013 195113 195213 195313 ...
##  $ Value       : chr [1:7896] "1271.551" "1261.267" "1158.679" "1079.206" ...
##  $ Column_Order: num [1:7896] 1 1 1 1 1 1 1 1 1 1 ...
##  $ Description : chr [1:7896] "Coal Consumed by the Residential Sector" "Coal Consumed by the Residential Sector" "Coal Consumed by the Residential Sector" "Coal Consumed by the Residential Sector" ...
##  $ Unit        : chr [1:7896] "Trillion Btu" "Trillion Btu" "Trillion Btu" "Trillion Btu" ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   MSN = col_character(),
##   ..   YYYYMM = col_double(),
##   ..   Value = col_character(),
##   ..   Column_Order = col_double(),
##   ..   Description = col_character(),
##   ..   Unit = col_character()
##   .. )
##  - attr(*, "problems")=<externalptr>
unique(orig_dat$Unit)
## [1] "Trillion Btu"

We observe that:

dat<-data.frame(orig_dat)
#Drop columns
dat<-subset(orig_dat, select = -c(Unit,Description,Column_Order))
###Missing values
dat$Value[dat$Value=='Not Available']<-NA

The column containing the dates is set to the numeric data type. We convert it to characters, extract the year and month, and finally drop the original column. Next, we drop the rows containing data from 1949 to 1999. Finally, we observe that there is a month 13, which corresponds to the total energy consumption for each year.

##Convert to character
dat$YYYYMM<-as.character(dat$YYYYMM)
##Create new column containing the dates
dat <- dat %>%
  add_column(year = substr(dat$YYYYMM, start = 1, stop = 4),.after = "YYYYMM") 
dat <- dat %>%
  add_column(mon = substr(dat$YYYYMM, start = 5, stop = 6),.after = "year")  
##Drop original column
dat <- subset(dat, select = -YYYYMM )
## Drop rows with a date before january 2000
ec<-subset(dat,year>=2000)
## Drop rows with month 13 which corresponds to total
ec<-subset(ec,mon!=13)

Finally, we convert all columns to the correct data types.

#Combine dates in the right format
ec$year<-as.numeric(ec$year)
ec$mon<-as.numeric(ec$mon)
ec$Value<-as.numeric(ec$Value)
ec$Date <- as.yearmon(paste(ec$year, ec$mon), "%Y %m")
The final data frame can be seen below:
MSN year mon Value Date
CLRCBUS 2000 1 1.466 Jan 2000
CLRCBUS 2000 2 1.093 Feb 2000
CLRCBUS 2000 3 0.848 Mar 2000
CLRCBUS 2000 4 0.967 Apr 2000
CLRCBUS 2000 5 0.648 May 2000

Note: The last column may appear redundant, since we already have the year and month in the previous columns. We keep it as the format is easier for visualisation purposes.

Missing Values

There are 166 missing values, as seen in the figure and table below. Those values are only missing from the data containing coal energy consumption. To make the visualization process easier, we dropped such values rather of using them.

#Find number a missing values per group
gg_miss_fct(ec, MSN) 

ec %>% group_by(MSN) %>% summarise(
  non_na = sum(!is.na(Value)),na = sum(is.na(Value)))
## # A tibble: 12 × 3
##    MSN     non_na    na
##    <chr>    <int> <int>
##  1 BMRCBUS    262     0
##  2 CLRCBUS     96   166
##  3 ESRCBUS    262     0
##  4 FFRCBUS    262     0
##  5 GERCBUS    262     0
##  6 LORCBUS    262     0
##  7 NNRCBUS    262     0
##  8 PARCBUS    262     0
##  9 RERCBUS    262     0
## 10 SORCBUS    262     0
## 11 TERCBUS    262     0
## 12 TXRCBUS    262     0
#Delete NAs and get final data frame
f.data <- na.omit(ec) 

Visualisations

Residential Sector Energy Consumption per source

Let’s start by visualising the energy consumption for each energy source. From the figure above it’s very hard to identify any clear trends. We can observe that natural gas is the most used energy source but there are big fluctuations between the winter and summer months. We can also see that over the years the electricity sales has a slight upward trend but the volatility stays the same. We’re going to visiualise individually the consumption for fossil fuels and renewable energy.

##Individual sources: Pie charts and Line graph Next we’re going to create a pie chart to get a better idea of the proportion of energy consumed per source. Finally, we’re going to visualise the mean energy consumed per individual energy source.

## `summarise()` has grouped output by 'year'. You can override using the `.groups` argument.

##Summary statistics and box plots per individual source

##       Date         BMRCBUS         CLRCBUS          ESRCBUS     
##  Min.   :2000   Min.   :28.38   Min.   :0.3820   Min.   :260.1  
##  1st Qu.:2002   1st Qu.:32.26   1st Qu.:0.6240   1st Qu.:315.7  
##  Median :2004   Median :33.61   Median :0.7730   Median :361.1  
##  Mean   :2004   Mean   :33.44   Mean   :0.8529   Mean   :367.1  
##  3rd Qu.:2006   3rd Qu.:35.34   3rd Qu.:1.0283   3rd Qu.:410.3  
##  Max.   :2008   Max.   :36.52   Max.   :1.7910   Max.   :512.1  
##     GERCBUS         LORCBUS          NNRCBUS          PARCBUS      
##  Min.   :0.681   Min.   : 604.5   Min.   : 111.1   Min.   : 54.93  
##  1st Qu.:0.803   1st Qu.: 691.3   1st Qu.: 142.9   1st Qu.: 82.13  
##  Median :1.107   Median : 769.5   Median : 312.4   Median :107.41  
##  Mean   :1.161   Mean   : 804.0   Mean   : 410.9   Mean   :120.08  
##  3rd Qu.:1.363   3rd Qu.: 892.7   3rd Qu.: 692.9   3rd Qu.:156.50  
##  Max.   :1.868   Max.   :1129.9   Max.   :1001.5   Max.   :226.45  
##     SORCBUS     
##  Min.   :2.718  
##  1st Qu.:3.648  
##  Median :4.649  
##  Mean   :4.446  
##  3rd Qu.:5.190  
##  Max.   :5.984
## Warning in var(if (is.vector(x) || is.factor(x)) x else as.double(x), na.rm =
## na.rm): NAs introduced by coercion
##        Date     BMRCBUS     CLRCBUS     ESRCBUS     GERCBUS     LORCBUS 
##          NA   1.9651206   0.3333607  61.7581942   0.3657290 136.0019233 
##     NNRCBUS     PARCBUS     SORCBUS 
## 286.4640758  45.3295816   0.9139452
## # A tibble: 1,930 × 3
## # Groups:   MSN [8]
##    MSN     Value Date     
##    <chr>   <dbl> <yearmon>
##  1 CLRCBUS 1.47  Jan 2000 
##  2 CLRCBUS 1.09  Feb 2000 
##  3 CLRCBUS 0.848 Mar 2000 
##  4 CLRCBUS 0.967 Apr 2000 
##  5 CLRCBUS 0.648 May 2000 
##  6 CLRCBUS 0.656 Jun 2000 
##  7 CLRCBUS 0.792 Jul 2000 
##  8 CLRCBUS 0.81  Aug 2000 
##  9 CLRCBUS 0.67  Sep 2000 
## 10 CLRCBUS 0.531 Oct 2000 
## # … with 1,920 more rows

Total Energy consumption

We are going to calculate the summary data for the total consumption and sales.The graphs below show the total energy consumed, the energy consumed in the recent years and the proportion of each one.

##     FFRCBUS          RERCBUS         TERCBUS        TXRCBUS      
##  Min.   : 134.5   Min.   :32.33   Min.   :1271   Min.   : 192.3  
##  1st Qu.: 202.5   1st Qu.:40.47   1st Qu.:1434   1st Qu.: 247.6  
##  Median : 354.4   Median :49.62   Median :1634   Median : 403.6  
##  Mean   : 503.3   Mean   :50.58   Mean   :1743   Mean   : 553.9  
##  3rd Qu.: 813.7   3rd Qu.:58.68   3rd Qu.:1972   3rd Qu.: 863.6  
##  Max.   :1224.4   Max.   :76.88   Max.   :2808   Max.   :1272.0

Univariate data

For the univariate EDA analysis we choose to study the natural gas consumption. We create a time series with the values that we subseted from the dataset starting from 2000 with monthly frequency.

MSN year mon Value Date
NNRCBUS 2000 1 881.954 Jan 2000
NNRCBUS 2000 2 792.410 Feb 2000
NNRCBUS 2000 3 562.305 Mar 2000
NNRCBUS 2000 4 409.652 Apr 2000
NNRCBUS 2000 5 233.428 May 2000

Visualisation

Time series Analysis-Autocorrelation and Partial Correlation

In this figure, it can be observed that the ACF is sinusoidal, which implies that the ACF is periodic. The peak recurs every 12 implies that the ACF has a period of 12 lags which when calculated is equivalent to a year. The ACF is not approaching zero and thus it is most probably not stationary.It can also be observed that the majority of the ACF values are not within the bounds of statistical significance.

Trend

We observe that there is a slight upward trend.

Seosonality

Multivariate Data

For the multivariate analysis we’re choosing to study the total fossil and renewable energy consumption.For this purpose we create a new dataframe containing the variables of interest and then we reshape it in order to make our analysis easier.

From the below figure we observe that there is a big difference between the fossil and the renewable energy consumption. In the sections below we will have a closer look at each one.

Scatter plot

ggplot(data = ec.mult_data, aes(x = Ren, y = Fos))+
  geom_point() + 
  geom_smooth(method = 'lm')
## `geom_smooth()` using formula 'y ~ x'

data.frame(fos.rank = rank(ec.mult_data$Fos), ren.rank = rank(ec.mult_data$Ren)) %>%
  ggplot(., aes(x = ren.rank, y = fos.rank))+
  geom_point() + 
  geom_smooth(method = 'lm')
## `geom_smooth()` using formula 'y ~ x'

cor(ec.mult_data$Fos, ec.mult_data$Ren, method = "pearson")
## [1] -0.2306415

ACF and PACF

Fossil energy consumption

We take a closer look at the fossil energy consumption.

Seasonality

There is a period that’s equal to 12 months.

## [1]  0.08518519 11.73913043
## [1] 12

Total renewable energy consumption

Trend

Volatility

Seasonality

## [1]  0.01851852 54.00000000